library("tidyverse")
library("here")
library("broom")08_PCA
PCA of Antibiotic Resistance Patterns Across Different Species
To explore shared patterns of antibiotic resistance across bacterial species, we performed a Principal Component Analysis (PCA) on the numeric resistance profiles. This approach reveals the main axes of variation in multi-drug resistance and allows us to visualize whether species cluster according to their resistance signatures.
Load Libraries
Load Data
MDR_df <- read.csv(here("data/03_dat_aug_wide.csv"))Convert Resistance Categories to Numeric
In 03_augment.qmd we transform the categorical resistance values (R, I, S) into numeric scores (1, 0.5, 0), allowing us to quantitatively analyze resistance patterns across samples.
Run Initial PCA
We perform a PCA using only the antibiotic-resistance columns, scaling the variables to unit variance to ensure equal contribution across antibiotics.
antibiotics_cols <- colnames(MDR_df)[c(8:(ncol(MDR_df)-1))]
pca_fit <- MDR_df |>
select(all_of(antibiotics_cols)) |>
prcomp(scale = TRUE)pca_all_plot <- pca_fit |>
augment(MDR_df) |>
ggplot(mapping = aes(x = .fittedPC1,
y = .fittedPC2,
color = Species)) +
geom_point(alpha = 0.2) +
labs(x = "PC1",
y = "PC2")
pca_all_plotAs seen in the initial PCA plot, the structure of the data is dominated by the large number of E. coli isolates, making it difficult to observe patterns from other species. This imbalance can also bias the interpretation of the principal components, since one species disproportionately drives the variance. To avoid this dominance and obtain a clearer, more balanced view of resistance patterns, we downsample each species to the same number of isolates.
ggsave(here("results/images/08_PCA_all.png"), pca_all_plot)Saving 7 x 5 in image
Create Balanced Dataset
We calculate the minimum number of isolates per species to determine the target size for constructing a balanced dataset. We downsample each species to the same number of isolates, preventing highly abundant species from dominating the PCA structure.
min_n <- MDR_df |>
count(Species) |>
summarise(min(n)) |>
pull()
balanced_df <- MDR_df |>
group_by(Species) |>
slice_sample(n = min_n) |>
ungroup()
balanced_df |>
count(Species)# A tibble: 9 × 2
Species n
<chr> <int>
1 Acinetobacter baumannii 177
2 Citrobacter spp. 177
3 Enterobacteria spp. 177
4 Escherichia coli 177
5 Klebsiella pneumoniae 177
6 Morganella morganii 177
7 Proteus mirabilis 177
8 Pseudomonas aeruginosa 177
9 Serratia marcescens 177
Run PCA on Balanced Dataset
We repeat the PCA on the balanced dataset.
pca_fit_balanced <- balanced_df |>
select(all_of(antibiotics_cols)) |>
prcomp(scale = TRUE)pca_downsampling_plot <- pca_fit_balanced |>
augment(balanced_df) |>
ggplot(mapping = aes(x = .fittedPC1,
y = .fittedPC2,
color = Species)) +
geom_point() +
labs(x = "PC1",
y = "PC2")
pca_downsampling_plotThe PCA of the balanced dataset shows that most species exhibit substantial overlap in their resistance profiles, indicating that many antibiotic-resistance patterns are shared across taxa. However, some species occupy more distinct regions of the PCA space. For instance, Pseudomonas aeruginosa and Serratia marcescens are positioned predominantly toward the right side of the plot, forming more defined clusters that show limited overlap with the left cloud dominated by Escherichia coli. This separation suggests that these species follow characteristic combinations of resistance traits that differ from the broader and more heterogeneous profiles observed in E. coli and other Enterobacterales. Overall, while many species share common multidrug-resistance patterns, a subset displays more distinct resistance signatures that set them apart from the main continuum.
ggsave(here("results/images/08_PCA_downsampling.png"), pca_downsampling_plot)Saving 7 x 5 in image